Overview

The available data is comprised of records containing a range of different types of measurements from motion sensors:

The motion sensors were placed on four different places in human subjects’ bodies and on the dumbbell itself:

There is a column for each one of these measurements in the provided data-sets (*.csv files). We are going to explore these variables in more detail in the sections that follow.

The objective of this analysis is to classify each record in the testing data-set as belonging to one of the following classes (column classe, “exercise class”):

Class Name Description
A Sitting (Still)
B Sitting Down
C Standing (Still)
D Standing Up
E Walking

The human subjects are six different users identified by the variable user_name.

I believe this summarizes everything we need to know about the nature of the information present in the data-sets, but you can find more detailed information here.

Installing the Required Packages

You might need to install the following packages if you don’t already have them:

install.packages("plyr")
install.packages("dplyr")
install.packages("stringr")
install.packages("Amelia")
install.packages("xtable")
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("caret")
install.packages("randomForest")
install.packages("e1071")
install.packages("rpart")
install.packages("tictoc")
install.packages("reshape2")

Just uncomment the packages you need and run this chunk before you run the remaining ones in this notebook.

Importing the Required Packages

Once the libraries are installed, they need to be loaded as follows:

suppressMessages(library(plyr, warn.conflicts = FALSE))          # Manipulating dataframes
suppressMessages(library(dplyr, warn.conflicts = FALSE))
suppressMessages(library(stringr))                               # String operations
suppressMessages(library(Amelia))                                # Missing data plots
suppressMessages(library(xtable))                                # Pretty printing dataframes
suppressMessages(library(ggplot2))                               # Plotting
suppressMessages(library(gridExtra))
suppressMessages(library(reshape2))
suppressMessages(library(caret))                                 # Machine learning
suppressMessages(library(randomForest))
suppressMessages(library(e1071))
suppressMessages(library(rpart))
suppressMessages(library(tictoc))                                # Benchmarking

Loading Data

Reading CSV Files

Let’s first load the data-sets:

training <- read.csv('../input/pml-training.csv', na.strings=c("NA", "NULL", ""), stringsAsFactors = F)
testing  <- read.csv('../input/pml-testing.csv', na.strings=c("NA", "NULL", ""), stringsAsFactors = F)


I also want to create a data-set which has all records. There’s an odd thing about these data-sets though: The record ID (X) starts over from one in the testing data-set, suggesting that the testing and training data-set were not once a whole data-set. I’m going to add the new column record_id so we can distinguish training and testing records more easily:

training_rows <- nrow(training)
testing_rows <- nrow(testing)
all_rows <- training_rows + testing_rows

training$record_id <- 1:training_rows
testing$record_id <- (training_rows+1):all_rows
all <- rbind.fill(training, testing)

Record Counting

I’m going to introduce a couple of functions before going forward. Data frames show nicely in RStudio, but they look terrible in the HTML output, so I have created a user defined function which renders data frames into HTML tables:

render_table_in_viewer_pane <- function(data, digits) {
  html <- print(xtable(data, digits = digits), type = "html", print.results=FALSE)
  temp <- tempfile(fileext = ".html")
  cat(html, file = temp)
  rstudioapi::viewer(temp)
}

render_table <- function(data, digits = 2) {
  print(xtable(data, digits = digits), type = "html")
}

I have also created a function for creating a summary of record count:

data_count <- function(training, testing) {
  result <-               data.frame(measurement = "training records", count = nrow(training))
  result <- rbind(result, data.frame(measurement = "testing records",  count = nrow(testing)))
  result <- rbind(result, data.frame(measurement = "total records",    count = nrow(training) + nrow(testing)))
  result <- rbind(result, data.frame(measurement = "variables",        count = length(names(training))))
  return(result)
}

Now we can pretty print the summary using this function:

render_table(data_count(training, testing))
measurement count
1 training records 19622
2 testing records 20
3 total records 19642
4 variables 161


render_table() will also render the HTML table in RStudio’s “Viewer”" (panel that should be visible on your right). You should be able to see it after executing this chunk.

Missing Data

Detecting Missing Data

It’s always a good idea to know about missing values from the start. I have created a function for this purpose. Another option would be using functions present in the “Amelia” package, but due to the large number of missing values (NAs) in this data, the miss maps just don’t look good.

missing_summary <- function(data) {
  count <- data %>% summarise_each(funs(sum(is.na(.))))
  output <- data.frame(variable = names(count), missing_count = t(count))
  rownames(output) <- 1:nrow(output)
  output <- output[output$missing_count > 0, ]
  return(output)
}

missing <- missing_summary(all[, !names(all) %in% c("classe") ])
render_table(missing)
variable missing_count
12 kurtosis_roll_belt 19236
13 kurtosis_picth_belt 19236
14 kurtosis_yaw_belt 19236
15 skewness_roll_belt 19236
16 skewness_roll_belt.1 19236
17 skewness_yaw_belt 19236
18 max_roll_belt 19236
19 max_picth_belt 19236
20 max_yaw_belt 19236
21 min_roll_belt 19236
22 min_pitch_belt 19236
23 min_yaw_belt 19236
24 amplitude_roll_belt 19236
25 amplitude_pitch_belt 19236
26 amplitude_yaw_belt 19236
27 var_total_accel_belt 19236
28 avg_roll_belt 19236
29 stddev_roll_belt 19236
30 var_roll_belt 19236
31 avg_pitch_belt 19236
32 stddev_pitch_belt 19236
33 var_pitch_belt 19236
34 avg_yaw_belt 19236
35 stddev_yaw_belt 19236
36 var_yaw_belt 19236
50 var_accel_arm 19236
51 avg_roll_arm 19236
52 stddev_roll_arm 19236
53 var_roll_arm 19236
54 avg_pitch_arm 19236
55 stddev_pitch_arm 19236
56 var_pitch_arm 19236
57 avg_yaw_arm 19236
58 stddev_yaw_arm 19236
59 var_yaw_arm 19236
69 kurtosis_roll_arm 19236
70 kurtosis_picth_arm 19236
71 kurtosis_yaw_arm 19236
72 skewness_roll_arm 19236
73 skewness_pitch_arm 19236
74 skewness_yaw_arm 19236
75 max_roll_arm 19236
76 max_picth_arm 19236
77 max_yaw_arm 19236
78 min_roll_arm 19236
79 min_pitch_arm 19236
80 min_yaw_arm 19236
81 amplitude_roll_arm 19236
82 amplitude_pitch_arm 19236
83 amplitude_yaw_arm 19236
87 kurtosis_roll_dumbbell 19236
88 kurtosis_picth_dumbbell 19236
89 kurtosis_yaw_dumbbell 19236
90 skewness_roll_dumbbell 19236
91 skewness_pitch_dumbbell 19236
92 skewness_yaw_dumbbell 19236
93 max_roll_dumbbell 19236
94 max_picth_dumbbell 19236
95 max_yaw_dumbbell 19236
96 min_roll_dumbbell 19236
97 min_pitch_dumbbell 19236
98 min_yaw_dumbbell 19236
99 amplitude_roll_dumbbell 19236
100 amplitude_pitch_dumbbell 19236
101 amplitude_yaw_dumbbell 19236
103 var_accel_dumbbell 19236
104 avg_roll_dumbbell 19236
105 stddev_roll_dumbbell 19236
106 var_roll_dumbbell 19236
107 avg_pitch_dumbbell 19236
108 stddev_pitch_dumbbell 19236
109 var_pitch_dumbbell 19236
110 avg_yaw_dumbbell 19236
111 stddev_yaw_dumbbell 19236
112 var_yaw_dumbbell 19236
125 kurtosis_roll_forearm 19236
126 kurtosis_picth_forearm 19236
127 kurtosis_yaw_forearm 19236
128 skewness_roll_forearm 19236
129 skewness_pitch_forearm 19236
130 skewness_yaw_forearm 19236
131 max_roll_forearm 19236
132 max_picth_forearm 19236
133 max_yaw_forearm 19236
134 min_roll_forearm 19236
135 min_pitch_forearm 19236
136 min_yaw_forearm 19236
137 amplitude_roll_forearm 19236
138 amplitude_pitch_forearm 19236
139 amplitude_yaw_forearm 19236
141 var_accel_forearm 19236
142 avg_roll_forearm 19236
143 stddev_roll_forearm 19236
144 var_roll_forearm 19236
145 avg_pitch_forearm 19236
146 stddev_pitch_forearm 19236
147 var_pitch_forearm 19236
148 avg_yaw_forearm 19236
149 stddev_yaw_forearm 19236
150 var_yaw_forearm 19236
161 problem_id 19622


We want to do imputation whenever is possible, but given the huge number of missing values for a number of columns in the data-sets (19216 out of 19622 are NAs in the training data-set) the only thing we can do is remove these columns:

remove_columns <- function(data, columns) {
  data <- data[, ! names(data) %in% columns]
  return(data)
}

extract_data_by_record_id <- function(data_from, ids_from) return(data_from[data_from$record_id %in% ids_from$record_id, ])

all <- remove_columns(all, missing$variable)
training <- extract_data_by_record_id(all, training)
testing <- extract_data_by_record_id(all, testing)

Now we can take a sample from the data and see how it looks like:

sample_data_frame <- function(data, size) {
  sample_index <- sample(1:nrow(data), size)
  return(data[sample_index, ])
}

sample <- sample_data_frame(training, 10)
render_table(sample)
X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z classe record_id
7004 7004 jeremy 1322673076 86744 30/11/2011 17:11 no 440 0.38 3.75 -88.50 5 -0.13 -0.03 -0.03 -13 5 45 39 634 -302 0.00 0.00 0.00 19 -3.48 1.59 -0.56 -57 151 -93 616 167 316 56.37 -51.89 -75.39 29 0.37 0.13 -0.34 -142 153 -196 -505 356 46.00 -163.00 44.20 -91.50 41 -0.18 1.03 -0.23 -98 -283 -269 -129 -612.00 -31.00 B 7004
10543 10543 carlitos 1323084310 520389 05/12/2011 11:25 no 381 2.06 5.97 -92.90 2 -0.03 -0.02 0.07 -13 2 19 6 599 -320 69.80 -29.00 22.60 18 -0.43 0.42 -0.41 80 48 -148 725 -64 232 -123.04 -22.37 25.22 4 -0.03 -0.11 0.07 -8 -34 9 -579 215 25.00 145.00 42.70 135.00 34 -0.55 2.73 0.38 -110 252 -190 -656 472.00 572.00 C 10543
13359 13359 charles 1322837917 856278 02/12/2011 14:58 no 131 119.00 16.20 -1.93 18 0.08 0.11 -0.15 -20 63 -162 23 595 -336 -44.40 -22.40 -14.80 16 0.72 -0.58 0.82 82 -105 -74 695 17 198 109.23 41.53 -22.38 5 -0.10 -0.18 -0.16 20 44 -11 -363 501 -49.00 -74.10 17.60 -164.00 52 -0.45 0.75 -0.20 -433 -266 -20 -474 -305.00 134.00 D 13359
9836 9836 eurico 1322489668 294676 28/11/2011 14:14 no 264 1.37 1.90 -88.20 2 0.05 0.00 -0.03 -6 3 23 27 567 -440 89.00 4.18 90.60 35 -3.60 1.38 -0.33 -222 254 27 -45 295 622 -92.18 -31.14 -54.79 7 0.39 -0.40 -0.07 -20 -52 -34 -497 119 281.00 177.00 20.00 -81.00 24 0.45 -3.84 -1.28 4 -118 207 139 -771.00 130.00 C 9836
7160 7160 jeremy 1322673083 874781 30/11/2011 17:11 no 446 1.33 5.46 -87.90 4 -0.06 0.02 0.02 -23 1 34 54 632 -317 0.00 0.00 0.00 35 -1.20 0.87 -0.26 -241 240 -54 -343 449 397 55.47 -49.01 -78.64 31 0.66 -0.26 -1.15 -143 160 -215 -539 288 91.00 148.00 6.88 134.00 46 1.67 -5.41 -0.75 160 411 -96 -587 685.00 866.00 B 7160
763 763 adelmo 1322832775 605371 02/12/2011 13:32 no 177 130.00 -39.80 166.00 19 0.08 0.11 -0.21 46 44 -174 182 587 -360 -74.80 21.50 105.00 24 0.47 -0.59 0.85 -220 -67 18 -137 192 663 47.95 -26.28 -101.29 19 0.16 0.08 -0.25 -49 87 -160 -571 260 2.00 0.00 0.00 0.00 13 0.16 1.03 0.21 76 94 37 659 -128.00 56.00 A 763
10485 10485 carlitos 1323084307 136427 05/12/2011 11:25 no 379 1.57 6.16 -92.90 3 -0.06 0.02 -0.05 -16 3 29 3 606 -305 66.30 -32.80 30.00 17 -0.48 0.48 -0.52 79 46 -138 714 -79 249 -120.75 -5.85 35.81 3 0.05 -0.16 -0.07 -2 -32 12 -590 195 12.00 143.00 37.10 136.00 32 0.14 3.08 0.56 -67 235 -191 -651 468.00 528.00 C 10485
79 79 carlitos 1323084235 72306 05/12/2011 11:23 no 15 1.14 7.31 -94.00 3 0.02 0.00 -0.03 -18 2 22 -7 602 -304 -131.00 17.90 -162.00 34 0.00 -0.02 0.02 -288 109 -125 -368 335 516 13.33 -70.85 -84.45 37 0.02 -0.02 -0.02 -235 48 -270 -555 294 -66.00 23.40 -63.60 -147.00 36 0.05 -0.02 -0.02 189 207 -217 -9 656.00 476.00 A 79
5853 5853 pedro 1323095009 941345 05/12/2011 14:23 no 80 122.00 25.90 -4.22 19 -0.43 -0.02 -0.41 -39 71 -167 2 587 -352 133.00 21.60 -38.60 48 1.20 -1.48 0.31 365 -130 -271 466 -115 -483 -99.07 -22.23 52.57 13 0.47 0.83 -0.92 -29 -110 66 276 -693 -149.00 118.00 43.10 122.00 38 -0.79 4.26 1.79 2 346 -145 -690 560.00 929.00 B 5853
5721 5721 pedro 1323095003 376329 05/12/2011 14:23 no 75 121.00 26.60 -4.47 18 -0.50 -0.03 -0.41 -42 72 -161 -1 588 -360 180.00 -43.10 26.40 33 -3.40 1.20 0.30 316 6 57 732 84 42 -93.96 -15.13 60.71 9 0.42 0.67 -0.66 -14 -75 53 323 -673 -153.00 -143.00 30.00 -88.00 35 -0.35 0.85 0.00 -60 -229 -242 -124 -623.00 -13.00 B 5721


Looking much better now. Let’s check for missing values once again:

render_table(missing_summary(training))
variable missing_count


No missing values to be found. Looking good.

We should do the same checking for the testing data-set as well:

render_table(missing_summary(testing))
variable missing_count
60 classe 20


Only classe is missing, which is ok since the testing data-set isn’t labeled.

Variables & Data Types

read.csv() will assign types to variables automatically. You can always check your data types evoking str():

str(training)
## 'data.frame':    19622 obs. of  61 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name           : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1: int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2: int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp      : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window          : chr  "no" "no" "no" "no" ...
##  $ num_window          : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : chr  "A" "A" "A" "A" ...
##  $ record_id           : int  1 2 3 4 5 6 7 8 9 10 ...


Note that most variables are numeric (either integers or floats). The exceptions being user_name, cvtd_timestamp, new_window and classe (all strings).

str()’s output is a bit hard to read, so I’m going to group the variable names together according with where the motion sensor is placed:

matching_variables <- function(data, pattern) {
  return(names(data)[grep(pattern, names(data))])
}

variable_summary <- function(data, patterns) {
  result <- data.frame(pattern = character(0), variables = character(0))
  for(pattern in patterns) {
    result <- rbind(result, data.frame(pattern = pattern, variables = paste(matching_variables(training, paste0(".*", pattern, ".*")), collapse=", ")))
  }
  return(result)
}

render_table(variable_summary(training, c("belt", "[^fore]+arm", "forearm", "dumbbell")))
pattern variables
1 belt roll_belt, pitch_belt, yaw_belt, total_accel_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_x, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, magnet_belt_z
2 [^fore]+arm roll_arm, pitch_arm, yaw_arm, total_accel_arm, gyros_arm_x, gyros_arm_y, gyros_arm_z, accel_arm_x, accel_arm_y, accel_arm_z, magnet_arm_x, magnet_arm_y, magnet_arm_z
3 forearm roll_forearm, pitch_forearm, yaw_forearm, total_accel_forearm, gyros_forearm_x, gyros_forearm_y, gyros_forearm_z, accel_forearm_x, accel_forearm_y, accel_forearm_z, magnet_forearm_x, magnet_forearm_y, magnet_forearm_z
4 dumbbell roll_dumbbell, pitch_dumbbell, yaw_dumbbell, total_accel_dumbbell, gyros_dumbbell_x, gyros_dumbbell_y, gyros_dumbbell_z, accel_dumbbell_x, accel_dumbbell_y, accel_dumbbell_z, magnet_dumbbell_x, magnet_dumbbell_y, magnet_dumbbell_z


The function variable_summary() simply looks up and groups together variables which names match the giving input pattern.

We will notice that these variables names are consistent with the experiment setup we have described in the overview section. To be completely honest, I did this variable exploration first and then added more information to the overview section later. I also relied on information from the study that provided the data and from this document on motion sensors, but I believe that you should rely more heavily on the data exploration.

A description for the remaining variables in the data-set follows:

Measurement Comment
X Record ID.
user_name Username of human subject being measured.
raw_timestamp_part_1 First part of the timestamp (which is broken into two variables (due to integer size restrictions in the apparatus it seems).
raw_timestamp_part_2 Second part of the timestamp.
cvtd_timestamp Timestamp formatted as a string: “%d/%m/%y %h:%m”.
new_window Is the window a new window: “yes” or “no”.
num_window Window ID.
classe “exercise class” with values A, B, C, D & E, as described in the overview section.

All measurements are performed inside a one second time window:

Motion Sensor Time Window

Motion Sensor Time Window

Note that there is window overlap. In some of the scatter plots you will find multiple horizontal parallel lines, which are explained by the multiple parallel variable measurements in each given timestamp.

I was a bit insecure about what the variable new_window represents, so decided to investigate further and created a function to extract records for a given num_window sorted by the timestamp:

extract_window <- function(data, num_window) {
  output <- data[data$num_window == num_window, ]
  output <- output[order(output$raw_timestamp_part_1, output$raw_timestamp_part_2), ]
  return(output[, names(output) %in% c("raw_timestamp_part_1", "raw_timestamp_part_2", "new_window", "num_window")])
}
  
render_table(extract_window(training, num_window = 12))
raw_timestamp_part_1 raw_timestamp_part_2 new_window num_window
4 1323084232 120339 no 12
5 1323084232 196328 no 12
6 1323084232 304277 no 12
7 1323084232 368296 no 12
8 1323084232 440390 no 12
9 1323084232 484323 no 12
10 1323084232 484434 no 12
11 1323084232 500302 no 12
12 1323084232 528316 no 12
13 1323084232 560359 no 12
14 1323084232 576390 no 12
15 1323084232 604281 no 12
16 1323084232 644302 no 12
17 1323084232 692324 no 12
18 1323084232 732306 no 12
19 1323084232 740353 no 12
20 1323084232 788335 no 12
21 1323084232 876301 no 12
22 1323084232 892313 no 12
23 1323084232 932285 no 12
24 1323084232 996313 yes 12


If you extract windows for different num_window values, you will find this very same pattern over and over. new_window seems to indicate that the measurement is the last within a given window and that num_window should be incremented for the next measurement.

Feature Engineering

Combining raw_timestamp_part_1 and raw_timestamp_par_2

I’m going to create some additional columns, which will make data a bit easier to work with. For starters, there’s no need to keep the timestamp separated into two different columns (that’s a limitation from the motion sensor apparatus):

add_timestamp <- function(data) {
  data$raw_timestamp <- data$raw_timestamp_part_1 + data$raw_timestamp_part_2
  return(data)
}

all <- add_timestamp(all)
training <- extract_data_by_record_id(all, training)
testing <- extract_data_by_record_id(all, testing)

NOTE: When imputating missing values or creating new features, I suggest always applying such transformations over the whole data (training + testing) when working with categorical variables. Let’s say that you have a factor named Pie with levels Apple, Pecan, Cherry and Pumpkim. Could be the case that your training data-set has only entries with levels Apple and Pecan and that your testing data-set Cherry and Pumpkim. On factor conversion, Pie from training will have different levels from Pie from testing (incompatible). This will be a pain when you’re ready to train your classifier. Of course, you can always override variable factor levels for both data-sets afterwards, but it’s a somehow ackward operation.

Making classe More Descriptive

“A”, “B”, “C”, etc, are not very descriptive, so I’m going to add a column named classe_description:

add_classe_description <- function(data) {
  descriptions <-                    data.frame(classe = "A", classe_description = "Sitting  (Still)")
  descriptions <-rbind(descriptions, data.frame(classe = "B", classe_description = "Sitting Down"))
  descriptions <-rbind(descriptions, data.frame(classe = "C", classe_description = "Standing (Still)"))
  descriptions <-rbind(descriptions, data.frame(classe = "D", classe_description = "Standing Up"))
  descriptions <-rbind(descriptions, data.frame(classe = "E", classe_description = "Walking"))
  descriptions$classe <- as.factor(descriptions$classe)
  data$classe <- as.factor(data$classe)
  return(left_join(data, descriptions, by = c("classe")))
}

all <- add_classe_description(all)
training <- extract_data_by_record_id(all, training)
testing <- extract_data_by_record_id(all, testing)

Looking Up Patterns

Before we start choosing features for all models, let’s begin with graphical exploration. In the next sections we will develop some tools for this purpose.

Plotting Tools

These are some custom plotting functions I use in my notebooks. I find useful to extract plotting code to functions for re-usability sake:

custom_density_plot <- function(data, x_column, color_column) {
  ggplot(data, aes_string(x = x_column, color = color_column)) + 
  geom_density(na.rm = TRUE)
}

custom_violin_plot <- function(data, x_column, color_column) {
  ggplot(data, aes_string(color_column, x_column)) + 
  geom_violin(aes_string(fill = color_column)) + 
  geom_point(position = position_jitter(width = 0.2), color = "blue", alpha = 0.2)
}

custom_scatter_plot <- function(data, x_column, y_column, color_column) {
  ggplot(data, aes_string(x = x_column, y = y_column, color = color_column)) + 
  geom_point()
}

custom_count_barchart <- function(data, column, title) {
   ggplot(data, aes_string(x=column)) + 
   geom_bar(fill = "blue", alpha = 0.2) +
   geom_text(stat='count', aes(label=sprintf("%d%% (%d)", round(..count.. * 100/sum(..count..), 0), ..count..), vjust=0)) +
   xlab(paste(column, "(", title, ")"))
}

correlation_matrix <- function(data) {
  numeric_data <- data[, sapply(data,is.numeric)]
  matrix <- round(cor(numeric_data), 2)
  matrix[upper.tri(matrix)] <- NA
  matrix <- melt(matrix, na.rm = TRUE)
  return(matrix)
}

custom_correlation_heat_map <- function(data) {
  matrix <- correlation_matrix(data)
  ggplot(data = matrix, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name="Pearson Correlation") +
  theme(axis.text.x = element_text(angle = 90)) +
  coord_fixed()
}

Their names should be descriptive enough for you to deduce what they do. Let’s take a look at the scatter plots first.

Graphical Exploration

I believe that makes more sense to break the data per subject (user_name) and not mix data from different human subjects:

jeremy_training <- training[training$user_name == "jeremy", ]
charles_training <- training[training$user_name == "charles", ]

jeremy_scatter <-  custom_scatter_plot(training[training$user_name == "jeremy", ], "raw_timestamp", "total_accel_belt", "classe_description")
charles_scatter <- custom_scatter_plot(training[training$user_name == "charles", ], "raw_timestamp", "total_accel_belt", "classe_description")
grid.arrange(jeremy_scatter, charles_scatter, ncol = 2)

I had a somehow naive expectation that I would be able to see some reminiscence of a continuous curve (given that these are time series, raw_timestamp is in the x axis), but the sensors don’t seem that refined and seem to provide essentially discrete measurements.

From the scatter plots alone, you have the impression that total_accel_belt would not be a great predictor given that its values seem to almost fully overlap for all categories of classe (we can’t see patches of different colors, each one representing a different classe).

You will see that this is not the case though, the variable is a fairly good predictor (not as good as the remaining ones though). Scatter plots are not the best tool for finding patterns in this particular data though. Despite of that, I’m going to plot another variable (total_accel_forearm, which seems to present the least amount of overlap) just to show a better case:

jeremy_scatter <-  custom_scatter_plot(jeremy_training, "raw_timestamp", "total_accel_forearm", "classe_description")
charles_scatter <- custom_scatter_plot(charles_training, "raw_timestamp", "total_accel_forearm", "classe_description")

grid.arrange(jeremy_scatter, charles_scatter, ncol = 2)

Given that these measurements are not as smooth as I was expecting and we can easily identify any patterns, density plots might be more suitable for the job. I’m also going to adopt “jeremy” as the user for plot analysis from this point on:

density_total_accel_forearm <- custom_density_plot(jeremy_training, "total_accel_forearm", "classe_description")
violin_total_accel_forearm <- custom_violin_plot(jeremy_training, "total_accel_forearm", "classe_description")

grid.arrange(density_total_accel_forearm, violin_total_accel_forearm, ncol = 2)


You should be familiar with density plots, but a good alternative are violin plots, which can be seem as “augmented box plots”. As in box plots, the region outside the violin box represents the outliers, while frequent values are kept inside. In violin box plots the width changes according with how frequent (or dense) a value is in the scale, while in box plots the box’s width is constant though (the box is square).

I feel like violin plots are better visual tools than density ones, so I’m going to use them from now on:

violin_total_accel_arm      <- custom_violin_plot(jeremy_training, "total_accel_arm", "classe_description")
violin_total_accel_belt     <- custom_violin_plot(jeremy_training, "total_accel_belt", "classe_description")
violin_total_accel_dumbbell <- custom_violin_plot(jeremy_training, "total_accel_dumbbell", "classe_description")

grid.arrange(violin_total_accel_forearm, violin_total_accel_arm, violin_total_accel_belt, violin_total_accel_dumbbell, ncol = 2)


I have only plotted four variables, one variable per violin plot. You might have noticed that the box shapes change quite a bit when we shift from one classe to another in a plot. We are going to take advantage of these differences in data density patterns to predict classe for the testing data-set.

Correlation Between Variables

I’m going to use the correlation heat map we have introduced in the plotting utilities section:

custom_correlation_heat_map(training)


I must remind you that correlation matrices only work with numeric variables.

Notice that some high correlations are obvious like total_accel_dumbbell and accel_dumbell_x or accel_dumbell_y or accel_dumbell_z (its components).

You will also notice that the motion of the dumbbell is highly correlated with the motion of the forearm, which also makes sense.

In general, most measurements seem uncorrelated (the heat map is quite sparse), which is a good thing, means that most variables are not superfluous and therefore, I don’t see a reason to do principal component analysis. I don’t think we would get much benefit from it.

Choosing Predictors

Some of the fields just won’t make good predictors, like the timestamps (raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp) and window variables (num_window and new_window) given that they obviously don’t correlate with classe (they are timestamps, indexes and control variables) and therefore we are going to remove them:

not_correlated_columns <- c("raw_timestamp_part_1", "raw_timestamp_part_2", "raw_timestamp", "cvtd_timestamp", "num_window", "new_window", "classe_description", "user_name")
training <- remove_columns(training, not_correlated_columns)
testing <- remove_columns(testing, not_correlated_columns)
all <- rbind.fill(training, testing)

I’m removing user_name as well. We don’t want this variable: The plan is to identify motion patterns for any individual, not motion patterns for particular individuals.

classe_description was just useful for analysis, so we will drop it as well.

A last step before we begin classification is converting classe to factor:

all$classe <- as.factor(all$classe)

training <- extract_data_by_record_id(all, training)
testing <- extract_data_by_record_id(all, testing)

That’s the last transformation required, so we can remove record_id:

training <- remove_columns(training, c("record_id"))
testing <- remove_columns(testing, c("record_id"))

Classification

In this section I’m going to try a few different classifier on the available data and compare how they perform using accuracy as a metric. You can extract these metrics from the confusion matrix, so I have created the following method for this purpose:

accuracy <- function(confusion) return(confusion$overall["Accuracy"])

Model Building

One thing you will noticed with machine learning packages is that they don’t quite follow a common interface. For instance, when getting a prediction using the “randomForest” package, type should be set to “class”, while for the package “party”, type should be set to “response”. I want to avoid copying & pasting programming as much as possible, so I want functions that evaluate models and determine their accuracy that work for multiple packages. For this reason, I have defined a common interface for all models:

function description
build(formula, data) Builds a model from a formula and training data.
predictions(model, data) Generate predictions for the input data using the model.

R doesn’t quite support OOP, but you can emulate some OOP principles like encapsulation, so I have encapsulated build() and predictions() inside an object. I also have included the attribute name, so we can easy identify the model.

In short, what you need to do is create a function that follows this template:

my_model_builder <- function() {
  
  name <- "<name of my model builder"
  
  build <- function(formula, data) "<my model builder code>"

  predictions <- function(model, data) "<my predictions function>"
  
  error_rate <- function(model) "<my OOB eror function>"

  return(list(name=name, build=build, predictions=predictions))
}

In the following chunk, I have created builders for a number of different packages:

SEED <- 345

random_forest_builder <- function() {

  name <- "Random Forest (randomForest package)"
  
  build <- function(formula, data) return(randomForest(formula, data = data))
  
  predictions <- function(model, data) return(predict(model, newdata = data, type = "class"))

  return(list(name=name, build=build, predictions=predictions))
}

svm_builder <- function() {
  
  name <- "SVM (e1071 package)"
  
  build <- function(formula, data) return(svm(formula, data = data, probability=TRUE))
  
  predictions <- function(model, data) return(predict(model, newdata = data, type = "class"))
  
  return(list(name=name, build=build, predictions=predictions))
}

decision_tree_builder <- function() {
  
  name <- "Decision Tree (rpart package)"
  
  build <- function(formula, data) return(rpart(formula, data = data, method="class"))

  predictions <- function(model, data) return(predict(model, newdata = data, type = "class"))

  return(list(name=name, build=build, predictions=predictions))
}

random_forest_k_fold_cross_validation_builder <- function() {
  name <- "Random Forest with K-Fold Cross Validation (caret package)"
  
  build <- function(formula, data) {
    control <- trainControl(method = "cv", number = 5)
    return(train(formula, data = data,
                 method = "rf",  trControl = control,
                 metric="Accuracy", importance = TRUE))
  }
  
  predictions <- function(model, data) return(predict(model, newdata = data, type = "raw"))

  return(list(name=name, build=build, predictions=predictions))
}

Model Evaluation

Evaluation Tools

Now we can implement a general function that can evaluate any model from any package as long as we build a builder object that follows this common interface:

wrap_string <- function(string) {
  return(paste(strwrap(string, 50), collapse="\n"))
}

formula_name <- function(formula) {
  name <- paste(format(formula), collapse = "")
  name <- wrap_string(name)
  return(name)
}

empty_evaluation <- function() {
  empty <- data.frame(model             = character(), 
                      formula           = character(), 
                      accuracy          = numeric(0),
                      error_rate        = numeric(0),
                      elapsed_time_secs = numeric(0),
                      stringsAsFactors=FALSE)
  return(empty)
}

evaluate_model <- function(formula, model_builder, prediction_column, train_data, cross_validation_data) {
  tic()
  builder <- model_builder()
  model <- builder$build(formula, train_data)
  predictions <- builder$predictions(model, cross_validation_data)
  confusion <- confusionMatrix(data = predictions, reference = cross_validation_data[, prediction_column])
  garbage <- capture.output(exec_time <- toc())
  evaluation <- data.frame(model             = builder$name, 
                           formula           = formula_name(formula),
                           accuracy          = accuracy(confusion), 
                           error_rate        = 1 - accuracy(confusion),
                           elapsed_time_secs = (exec_time$toc - exec_time$tic),
                           stringsAsFactors=FALSE)
  return(evaluation)
}

evaluate_models <- function(formulas, model_builders, prediction_column, train_data, cross_validation_data) {
  evaluations <- empty_evaluation()

  for (model_builder in model_builders) {
    for (formula in formulas) {
      set.seed(SEED)
      evaluation <- evaluate_model(formula, model_builder, prediction_column, train_data, cross_validation_data)
      evaluations <- bind_rows(evaluations, evaluation)
    }
  }
  render_table(evaluations, digits = 6)
}

Armed with this evaluation function I can easily evaluate new models without having to change much code. All I have to do is create a builder object and add it to the list of models passed to evaluate_models() and I don’t need to modify the evaluation function in any way.

If you are not a programmer, this code design practice is called the open/closed principle.

Evaluating Different Model Types

Let’s try to evaluate a couple of models now. We need X to identify each record when generating the output, but X has no value as a predictor and therefore it has been excluded from the formula:

set.seed(SEED)
train_idx  <- createDataPartition(training$classe, p = 0.6, list = FALSE, times = 1)
train_data <- training[train_idx, ]
cross_validation_data  <- training[-train_idx, ]

formulas <- c(classe ~ . - X)
models <- c(random_forest_builder, svm_builder, decision_tree_builder, random_forest_k_fold_cross_validation_builder)
evaluate_models(formulas, models, "classe", train_data, cross_validation_data)
model formula accuracy error_rate elapsed_time_secs
1 Random Forest (randomForest package) classe ~ . - X 0.995284 0.004716 78.904000
2 SVM (e1071 package) classe ~ . - X 0.939969 0.060031 174.169000
3 Decision Tree (rpart package) classe ~ . - X 0.725593 0.274407 3.902000
4 Random Forest with K-Fold Cross Validation (caret package) classe ~ . - X 0.992735 0.007265 2006.722000


Notice that I’m breaking the origin training data-set into a new training (train_data, let’s called “sub-training”) and cross validation data-sets. The error rate showed in the output is a good estimate of the out-of-sample error in the testing data-set, given that is calculated over cross validation data, which doesn’t share any records with the sub-training data-set.

You might notice that I have a single evaluation function that can evaluate any model for any package (as long a create a model builder object that follows the pre-defined interface). Also note that I’m breaking the training data into train_data and cross_validation_data, given that the provided testing data-set isn’t labeled.

From the evaluation results, random forest (from the “randomForest” package) performed the best. I have also used random forest with k-fold cross validation, which took almost 20 minutes to complete (“k” equal to 5) and performed worst than the first. I guess it could be improved by increasing k, but it would make the creation of the classifier much slower. It seems like the default values of ntree (500) and mtry (5) for the “randomForest” trainer do a good enough job.

Evaluating with Subsets of Features

This experiment makes uses of a range of different sensors (roll, pitch, yaw, acceleration, etc) located in different parts of the human subject’s body (arm, forearm, belt & dumbbell).

I was wondering about how much accuracy I could get from these sensors separately (for different body parts or types of measurement). That’s an interesting question: Do we really need that many sensors?

The random forest (package randomForest) seems quite fast, so I’m going to repeat the evaluation for different combinations of features.

Features Grouped by Body Part

belt_formula <- classe~roll_belt+pitch_belt+yaw_belt+total_accel_belt+
                  gyros_belt_x+gyros_belt_y+gyros_belt_z+
                  accel_belt_x+accel_belt_y+accel_belt_z+
                  magnet_belt_x+magnet_belt_y+magnet_belt_z

arm_formula <- classe~roll_arm+pitch_arm+yaw_arm+total_accel_arm+
                 gyros_arm_x+gyros_arm_y+gyros_arm_z+
                 accel_arm_x+accel_arm_y+accel_arm_z+
                 magnet_arm_x+magnet_arm_y+magnet_arm_z

forearm_formula <- classe~roll_forearm+pitch_forearm+yaw_forearm+total_accel_forearm+
                     gyros_forearm_x+gyros_forearm_y+gyros_forearm_z+
                     accel_forearm_x+accel_forearm_y+accel_forearm_z+
                     magnet_forearm_x+magnet_forearm_y+magnet_forearm_z

dumbbell_formula <- classe~roll_dumbbell+pitch_dumbbell+yaw_dumbbell+total_accel_dumbbell+
                      gyros_dumbbell_x+gyros_dumbbell_y+gyros_dumbbell_z+
                      accel_dumbbell_x+accel_dumbbell_y+accel_dumbbell_z+
                      magnet_dumbbell_x+magnet_dumbbell_y+magnet_dumbbell_z
formulas <- c(belt_formula, arm_formula, forearm_formula, dumbbell_formula)
models <- c(random_forest_builder)
evaluate_models(formulas, models, "classe", train_data, cross_validation_data)
model formula accuracy error_rate elapsed_time_secs
1 Random Forest (randomForest package) classe ~ roll_belt + pitch_belt + yaw_belt + total_accel_belt + gyros_belt_x + gyros_belt_y + gyros_belt_z + accel_belt_x + accel_belt_y + accel_belt_z + magnet_belt_x + magnet_belt_y + magnet_belt_z 0.905430 0.094570 23.291000
2 Random Forest (randomForest package) classe ~ roll_arm + pitch_arm + yaw_arm + total_accel_arm + gyros_arm_x + gyros_arm_y + gyros_arm_z + accel_arm_x + accel_arm_y + accel_arm_z + magnet_arm_x + magnet_arm_y + magnet_arm_z 0.861840 0.138160 23.158000
3 Random Forest (randomForest package) classe ~ roll_forearm + pitch_forearm + yaw_forearm + total_accel_forearm + gyros_forearm_x + gyros_forearm_y + gyros_forearm_z + accel_forearm_x + accel_forearm_y + accel_forearm_z + magnet_forearm_x + magnet_forearm_y + magnet_forearm_z 0.895106 0.104894 23.593000
4 Random Forest (randomForest package) classe ~ roll_dumbbell + pitch_dumbbell + yaw_dumbbell + total_accel_dumbbell + gyros_dumbbell_x + gyros_dumbbell_y + gyros_dumbbell_z + accel_dumbbell_x + accel_dumbbell_y + accel_dumbbell_z + magnet_dumbbell_x + magnet_dumbbell_y + magnet_dumbbell_z 0.885419 0.114581 23.252000


Features Grouped by Type of Measurement

roll_pitch_yaw_formula <- classe~roll_belt+pitch_belt+yaw_belt+
                            roll_arm+pitch_arm+yaw_arm+
                            roll_forearm+pitch_forearm+yaw_forearm+
                            roll_dumbbell+pitch_dumbbell+yaw_dumbbell

total_accel_formula <- classe~total_accel_belt+total_accel_arm+total_accel_forearm+total_accel_dumbbell

gyros_formula <- classe~gyros_belt_x+gyros_belt_y+gyros_belt_z+
                   gyros_arm_x+gyros_arm_y+gyros_arm_z+
                   gyros_forearm_x+gyros_forearm_y+gyros_forearm_z+
                   gyros_dumbbell_x+gyros_dumbbell_y+gyros_dumbbell_z
                 
magnet_formula <- classe~magnet_belt_x+magnet_belt_y+magnet_belt_z+
                    magnet_arm_x+magnet_arm_y+magnet_arm_z+
                    magnet_forearm_x+magnet_forearm_y+magnet_forearm_z+
                    magnet_dumbbell_x+magnet_dumbbell_y+magnet_dumbbell_z 

accel_formula <- classe~accel_belt_x+accel_belt_y+accel_belt_z+
                   accel_arm_x+accel_arm_y+accel_arm_z+
                   accel_forearm_x+accel_forearm_y+accel_forearm_z+
                   accel_dumbbell_x+accel_dumbbell_y+accel_dumbbell_z
formulas <- c(roll_pitch_yaw_formula, total_accel_formula, gyros_formula, magnet_formula, accel_formula)
evaluate_models(formulas, models, "classe", train_data, cross_validation_data)
model formula accuracy error_rate elapsed_time_secs
1 Random Forest (randomForest package) classe ~ roll_belt + pitch_belt + yaw_belt + roll_arm + pitch_arm + yaw_arm + roll_forearm + pitch_forearm + yaw_forearm + roll_dumbbell + pitch_dumbbell + yaw_dumbbell 0.985725 0.014275 19.548000
2 Random Forest (randomForest package) classe ~ total_accel_belt + total_accel_arm + total_accel_forearm + total_accel_dumbbell 0.675631 0.324369 13.123000
3 Random Forest (randomForest package) classe ~ gyros_belt_x + gyros_belt_y + gyros_belt_z + gyros_arm_x + gyros_arm_y + gyros_arm_z + gyros_forearm_x + gyros_forearm_y + gyros_forearm_z + gyros_dumbbell_x + gyros_dumbbell_y + gyros_dumbbell_z 0.881723 0.118277 27.022000
4 Random Forest (randomForest package) classe ~ magnet_belt_x + magnet_belt_y + magnet_belt_z + magnet_arm_x + magnet_arm_y + magnet_arm_z + magnet_forearm_x + magnet_forearm_y + magnet_forearm_z + magnet_dumbbell_x + magnet_dumbbell_y + magnet_dumbbell_z 0.955774 0.044226 21.167000
5 Random Forest (randomForest package) classe ~ accel_belt_x + accel_belt_y + accel_belt_z + accel_arm_x + accel_arm_y + accel_arm_z + accel_forearm_x + accel_forearm_y + accel_forearm_z + accel_dumbbell_x + accel_dumbbell_y + accel_dumbbell_z 0.938440 0.061560 22.220000

Combining Features from Different Groups

belt_forearm_formula <- classe~roll_belt+pitch_belt+yaw_belt+total_accel_belt+
                          gyros_belt_x+gyros_belt_y+gyros_belt_z+
                          accel_belt_x+accel_belt_y+accel_belt_z+
                          magnet_belt_x+magnet_belt_y+magnet_belt_z+
                          roll_forearm+pitch_forearm+yaw_forearm+total_accel_forearm+
                          gyros_forearm_x+gyros_forearm_y+gyros_forearm_z+
                          accel_forearm_x+accel_forearm_y+accel_forearm_z+
                          magnet_forearm_x+magnet_forearm_y+magnet_forearm_z

roll_pitch_yaw_magnet_formula <- classe~roll_belt+pitch_belt+yaw_belt+
                                   roll_arm+pitch_arm+yaw_arm+
                                   roll_forearm+pitch_forearm+yaw_forearm+
                                   roll_dumbbell+pitch_dumbbell+yaw_dumbbell+
                                   magnet_belt_x+magnet_belt_y+magnet_belt_z+
                                   magnet_arm_x+magnet_arm_y+magnet_arm_z+
                                   magnet_forearm_x+magnet_forearm_y+magnet_forearm_z+
                                   magnet_dumbbell_x+magnet_dumbbell_y+magnet_dumbbell_z
formulas <- c(belt_forearm_formula, roll_pitch_yaw_magnet_formula)
evaluate_models(formulas, models, "classe", train_data, cross_validation_data)
model formula accuracy error_rate elapsed_time_secs
1 Random Forest (randomForest package) classe ~ roll_belt + pitch_belt + yaw_belt + total_accel_belt + gyros_belt_x + gyros_belt_y + gyros_belt_z + accel_belt_x + accel_belt_y + accel_belt_z + magnet_belt_x + magnet_belt_y + magnet_belt_z + roll_forearm + pitch_forearm + yaw_forearm + total_accel_forearm + gyros_forearm_x + gyros_forearm_y + gyros_forearm_z + accel_forearm_x + accel_forearm_y + accel_forearm_z + magnet_forearm_x + magnet_forearm_y + magnet_forearm_z 0.987764 0.012236 40.589000
2 Random Forest (randomForest package) classe ~ roll_belt + pitch_belt + yaw_belt + roll_arm + pitch_arm + yaw_arm + roll_forearm + pitch_forearm + yaw_forearm + roll_dumbbell + pitch_dumbbell + yaw_dumbbell + magnet_belt_x + magnet_belt_y + magnet_belt_z + magnet_arm_x + magnet_arm_y + magnet_arm_z + magnet_forearm_x + magnet_forearm_y + magnet_forearm_z + magnet_dumbbell_x + magnet_dumbbell_y + magnet_dumbbell_z 0.993500 0.006500 34.357000


Evaluation Aftermath

Not surprisingly, total_accel_formula didn’t fare well, given that taking the vector magnitude removes spacial information from acceleration, thus accel_formula did much better.

roll_pitch_yaw_magnet_formula (combining roll, pitch, yawn and magnetic measurements) seems to do almost as good in predicting classe as the full formula, with an accuracy of nearly 0.99.

belt_forearm_formula (combining belt and forearm measurements) also has accuracy close to 0.99.

If we were to put an actual bio-metric fitness product in the market (and I guess that was one of the motivations of the research), I believe that we would try to simplify our design as much as possible.

There’s also the matter of convenience: I’m guessing that most people are probably not willing to wear this many sensors.

Feature Importance

builder <- random_forest_builder()
model <- builder$build(classe ~ . - X, training)
varImpPlot(model, type=2)


You may have noticed that the best predictors are roll, pitch, yaw and magnetic measurements, as we have pointed out in the previous section when evaluating models with different combinations of features.

Predictions

Let’s generate an output file with our predictions now:

remove_non_alphanumeric <- function(string) {
  return(str_replace_all(string, "[^\\w]", ""))
}

output_name <- function(formula, builder_name) {
  output <- paste0(builder_name, formula_name(formula))
  output <- remove_non_alphanumeric(output)
  output <- paste0(output, ".csv")
  return(output)
}

generate_predictions <- function(formula, model_builder, prediction_column, index_column, train_data, input_data) {
  builder <- model_builder()
  model <- builder$build(formula, train_data)
  predictions <- builder$predictions(model, input_data)
  input_data[, prediction_column] <- predictions
  output <- input_data[, names(input_data) %in% c(index_column, prediction_column)]
  write.csv(output, file = output_name(formula, builder$name), row.names = FALSE)
  return(input_data)
}

output <- generate_predictions(classe ~ . - X, random_forest_builder, "classe", "X", training, testing)

Let’s compare the counts for the training and testing data-set as a sanity check:

training_count <- custom_count_barchart(training, "classe", "training")
testing_count <- custom_count_barchart(output, "classe", "testing")

grid.arrange(training_count, testing_count, ncol = 2)


I was expecting similar counts (percentages), but it’s not necessarily the case that the training and testing data-sets where randomly split from a single whole data-set. If you recall, X (record ID) for the testing data-set overlaps with the ones for the training data-set suggesting that they were generated by independent experiments with the same setup.

Final Words

An interesting thing to notice about our data is the amount of measurements taken (accelerations, angular velocities, etc). One would think that tracking a single variable over time (position for instance) would be enough to deduce classe.

Also, looking at the scatter plots you will notice that the data is not that smooth and is quite noisy.

In practice measurement apparatus are never perfect and the best ones might be quite expensive. You get more bang for your buck by using many low quality inexpensive apparatus than few high quality expensive ones. Even though the quality of our measurements is not that great, we can compensate by taking more measurements (from different sources) and still do very well in the classification.